Testing the Tester: Lessons Learned 
During the Testing of a State-of-the-Art 
Commercial 14nm Processor Under 
Proton Irradiation 
Carl M. Szabo, Jr. 

AS&D, Inc., supporting 
NASA Electronic Parts and Packaging (NEPP) Program 
NASA/IGSFC 
carl.m.szabo@nasa.gov 
301-286-8890 
Adam R. Duncan, Naval Surface Warfare Center Crane, and 


Kenneth A. LaBel, NASA GSFC 


To be presented by Carl M. Szabo, Jr., at the RADECS 2017 Radiation Effects on Components and Systems (RADECS) Conference, Geneva, Switzerland, October 3, 2017. 


Acronyms 


e Basic Input/Output System (BIOS) 

e Device Under Test (DUT) 

e Graphical Processing Unit (GPU) 

e Goddard Space Flight Center (GSFC) 

e High Definition Multimedia Interface (HDMI) 

e Massachusetts General Hospital (MGH) 

e National Aeronautics and Space Administration (NASA) 
e NASA Electronic Parts and Packaging (NEPP) Program 
e Random Access Memory (RAM) 

¢ Solid State Disk (SSD) 

e single event effect (SEE) 

¢ Thermal Design Power (TDP) 

e Universal Serial Bus (USB) 
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Background 


Circa 1986, playing with an Atari 1040ST 


¢ Lifelong Computer Hobbyist and Enthusiast 


— Unconventional Training and Skill Set 


¢ System Administrator supporting GSFC since 2002 


— Duties often require flexibility and out of the box thinking to 
solve unplanned problems / handle unexpected events 


e Introduced to Radiation Effects ~2012 
— “Person Under Test” 
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Device Under Test 
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e Intel core 15-6600K “Skylake” Microprocessor 


— ASUS Z170M-Plus Motherboard, 83GB RAM, 750W Power Supply, 
SSD, USB and HDMI over Ethernet control 


— Microsoft Windows Server 2012R2 OS, HWiINFO System 
Monitoring, Linpack, FurMark, Prime95 Stress, Batch File Control 
— In-situ, “System Level” Best Effort Approach 


e¢ Proton Testing via 


— TRIUMF 105 MeV Beam Line (November 2015) 
— MGH 200 MeV Beam Line (October 2016) 
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What Happened S 


¢ Hard failure event observed during TRIUMF visit 


— Device appeared to lose integrated GPU functionality 
during irradiation 
e Failure occurred during “Full” test (Linpack + FurMark) 
with only 1 CPU core active 


— Results were difficult to explain at the time of testing 


¢ Subsequent testing at MGH yielded no functional 
failures after 60+ test runs! 
— How?? 


e Next day at MGH, re-tested board used during 
TRIUMF tests 


e Processors began to fail! 
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Troubleshooting 1/3 S 


e Product was new and period of testing was short 
— First setup featured early release hardware 
— Public discovers flaws (Prime95 lockup issue) 


e Test setup evolved as the device technology matured 
— Later procured motherboards featured updates 
¢ BIOS revision of board used at MGH operated DUT differently 
than board and BIOS version used at TRIUMF 


— Supporting hardware and software enabled enhanced data collection 
¢ Accurate data 


— Evolution of test setup allowed insight that was not possible in early 
testing 


— Retesting on the MGH and TRIUMF-tested boards showed same 
behaviors with fresh processors 
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Troubleshooting 2/3 


e Large differences in functional parameters 
— Failures only occurred during exposure to protons 
— These differences would likely be transparent to regular users 
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Troubleshooting 3/3 S 


¢ Motherboard used at TRIUMF operated DUT in excess of 
rated 91 Watt TDP! 
— Only 1 processing core active 
— Degradation of performance after 18 hour extreme stress test 


¢ Failed Linpack tests 
¢ Could not reproduce GPU functional failure 


e Motherboard used at MGH operated DUT more efficiently 
— Lower temperature operation 
— Fewer changes in voltage 
— Slightly better performance 


¢ Control Motherboard (latest BIOS available as of Sep. 2017) 
— Behavior largely the same as MGH motherboard 
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Conclusion asad 


e Early hardware and software is imperfect 
— Perform updates BIOS, microcode, hardware and 
software 
¢ Up-to-date hardware and software leads to 
— Increased data 
— Accurate data 
¢ Correctable / Uncorrectable Error Reporting 
e However, current product cycle is changing 
quickly 
— How feasible to characterize? 
e Limited time begets limited reliability data 


— Flight project cannot tolerate lack of supply + reliability 
data, nor frequent updates 
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